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Rate Distortion and Denoising of Individual 
Data Using Kolmogorov complexity 

Nikolai K. Vereshchagin and Paul M.B. Vitanyi 
Abstract 

We examine the structure of families of distortion balls from the perspective of Kolmogorov complexity. Special 
attention is paid to the canonical rate-distortion function of a source word which returns the minimal Kolmogorov 
complexity of all distortion balls containing that word subject to a bound on their cardinality. This canonical rate- 
distortion function is related to the more standard algorithmic rate-distortion function for the given distortion measure. 
Examples are given of list distortion, Hamming distortion, and Euclidean distortion. The algorithmic rate-distortion 
function can behave differently from Shannon's rate-distortion function. To this end, we show that the canonical rate- 
distortion function can and does assume a wide class of shapes (unlike Shannon's); we relate low algorithmic mutual 
information to low Kolmogorov complexity (and consequently suggest that certain aspects of the mutual information 
formulation of Shannon's rate-distortion function behave differently than would an analogous formulation using 
algorithmic mutual information); we explore the notion that low Kolmogorov complexity distortion balls containing 
a given word capture the interesting properties of that word (which is hard to formalize in Shannon's theory) and 
this suggests an approach to denoising; and, finally, we show that the different behavior of the rate-distortion curves 
of individual source words to some extent disappears after averaging over the source words. 

I. Introduction 

Rate distortion theory analyzes the transmission and storage of information at insufficient bit rates. The aim is 
to minimize the resulting information loss expressed in a given distortion measure. The original data is called the 
'source word' and the encoding used for transmission or storage is called the 'destination word.' The number of bits 
available for a destination word is called the 'rate.' The choice of distortion measure is usually a selection of which 
aspects of the source word are relevant in the setting at hand, and which aspects are irrelevant (such as noise). 
For example, in application to lossy compression of a sound file this results in a compressed file where, among 
others, the very high and very low inaudible frequencies have been suppressed. The distortion measure is chosen 
such that it penalizes the deletion of the inaudible frequencies but lightly because they are not relevant for the 
auditory experience. We study rate distortion of individual source words using Kolmogorov complexity and show 
how it is related to denoising. The classical probabilistic theory is reviewed in Appendix A. Computability notions 
are reviewed in Appendix B and Kolmogorov complexity in Appendix C. Randomness deficiency according to 
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Definition 8 and its relation to the fitness of a destination word for a source word is explained further in Appendix D. 
Appendix E gives the proof, required for a Hamming distortion example, that every large Hamming ball can be 
covered by a small number of smaller Hamming balls (each of equal cardinality). More specifically, the number of 
covering balls is close to the ratio between the cardinality of the large Hamming ball and the small Hamming ball. 
The proofs of the theorems are deferred to Appendix F. 

A. Related Work 

In [8] A.N. Kolmogorov formulated the 'structure function' which can be viewed as a proposal for non- 
probabilistic model selection. This function and the associated Kolmogorov sufficient statistics are partially treated 
in [19], [24], [6] and analyzed in detail in [22]. We will show that the structure function approach can be generalized 
to give an approach to rate distortion and denoising of individual data. 

Classical rate-distortion theory was initiated by Shannon in [17]. In [18] Shannon gave a nonconstructive 
asymptotic characterization of the expected rate-distortion curve of a random variable (Theorem 5 in Appendix A). 
References [1], [2] treat more general distortion measures and random variables in the Shannon framework. 

References [25], [13], [20] relate the classical and algorithmic approaches according to traditional information- 
theoretic concerns. We follow their definitions of the rate-distortion function. The results show that if the source 
word is obtained from random i.i.d. sources, then with high probability and in expectation its individual rate- 
distortion curve is close to the Shannon's single rate-distortion curve. In contrast, our Theorem 1 shows that for 
distortion measures satisfying properties 1 through 4 below there are many different shapes of individual rate- 
distortion functions related to the different individual source words, and many of them are very different from 
Shannon's rate-distortion curve. 

Also Ziv [26] considers a rate-distortion function for individual data. The rate-distortion function is assigned to 
every infinite sequence uo of letters of a finite alphabet T. The source words x are prefixes of uo and the encoding 
function is computed by a finite state transducer. Kolmogorov complexity is not involved. 

In [16], [12], [4], [5] alternative approaches to denoising via compression and in [15], [14] applications of the 
current work are given. 

In [22] Theorems 1, 3 were obtained for a particular distortion measure relevant to model selection (the example 
C in this paper). The techniques used in that paper do not generalize to prove the current theorems which concern 
arbitrary distortion measures satisfying certain properties given below. 

B. Results 

A source word is taken to be a finite binary string. Destination words are finite objects (not necessarily finite 
binary strings). For every destination word encoding a particular source word with a certain distortion, there is a 
finite set of source words that are encoded by this destination word with at most that distortion. We call these finite 
sets of source words 'distortion balls.' Our approach is based on the Kolmogorov complexity of distortion balls. 
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For every source word we define its 'canonical' rate-distortion function, from which the algorithmic rate-distortion 
function of that source word can be obtained by a simple transformation, Lemma 2. 

Below we assume that a distortion measure satisfies certain properties which are specified in the theorems 
concerned. In Theorem 1 it is shown that there are distinct canonical rate-distortion curves (and hence distinct 
rate-distortion curves) associated with distinct source words (although some curves may coincide). Moreover, every 
candidate curve from a given family of curves is realized approximately as the canonical rate-distortion curve (and 
hence for a related family of curves every curve is realized approximately as the rate-distortion curve) of some 
source word. In Theorem 2 we prove a Kolmogorov complexity analogue for Shannon's theorem, Theorem 5 in 
Appendix A, on the characterization of the expected rate-distortion curve of a random variable. The new theorem 
states approximately the following: For every source word and every destination word there exists another destination 
word that has Kolmogorov complexity equal to algorithmic information in the first destination word about the source 
word, up to a logarithmic additive term, and both destination words incur the same distortion with the source word. 
(The theorem is given in the distortion-ball formulation of destination words.) In Theorem 3 we show that, at every 
rate, the destination word incurring the least distortion is in fact the 'best-fitting' among all destination words at 
that rate. 'Best-fitting' is taken in the sense of sharing the most properties with the source word. (This notion of 
a 'best-fitting' destination word for a source word can be expressed in Kolmogorov complexity, but not in the 
classic probabilistic framework. Hence there is no classical analogue for this theorem.) It turns out that this yields 
a method of denoising by compression. Finally, in Theorem 4, we show that the expectation of the algorithmic 
rate-distortion functions is pointwise related to Shannon's rate-distortion function, where the closeness depends on 
the Kolmogorov complexities involved and ergodicity and stationarity of the source. 

II. Preliminaries 

A. Data and Binary Strings 

We write string to mean a finite binary string. Other finite objects can be encoded into strings in natural ways. 
The set of strings is denoted by {0, 1}*. The length of a string x is the number of bits in it denoted as |x|. The 
empty string e has length |e| = 0. Identify the natural numbers N (including 0) and {0,1}* according to the 
correspondence 

(0,e), (1,0), (2,1), (3, 00), (4, 01),.... (1) 

Then, |010| = 3. The emphasis is on binary sequences only for convenience; observations in every finite alphabet 
can be so encoded in a way that is 'theory neutral'. For example, if a finite alphabet £ has cardinality 2 k , then 
every element i e S can be encoded by a(i) which is a block of bits of length k. With this encoding every x e S* 
satisfies that the Kolmogorov complexity C(x) — C(a(x)) (see Appendix C for basic definitions and results on 
Kolmogorov complexity) up to an additive constant that is independent of x. 
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B. Rate-Distortion Vocabulary 

Let X be a set, called the source alphabet whose elements are called source words or messages. We also use 
a set y called the destination alphabet, whose elements are called destination words. (The destination alphabet 
is also called the reproduction alphabet.) In general there are no restrictions on the set X; it can be countable or 
uncountable. However, for technical reasons, we assume X = {0, 1}*. On the other hand, it is important that the 
set y consists of finite objects: we need that the notion of Kolmogorov complexity C(y) be defined for all y E y. 
(Again, for basic definitions and results on Kolmogorov complexity see Appendix C.) In this paper it is not essential 
whether we use plain Kolmogorov complexity or the prefix variant; we use plain Kolmogorov complexity. 

Suppose we want to communicate a source word x E X using a destination word y € y that can be encoded 
in at most r bits in the sense that the Kolmogorov complexity C(y) < r. Assume furthermore that we are given 
a distortion function d : X x y — > 7\L(J{oo}, that measures the fidelity of the destination word against the source 
word. Here 1Z denotes the nonnegative real numbers, 

DEFINITION 1: Let x € X = {0, 1}* and Q denote the rational numbers. The rate-distortion function r x : Q — ► 
N is the minimum number of bits in a destination word y to obtain a distortion of at most S defined by 

r x (S) = min{C(y) : d(x,y) < 6} 

The 'inverse' of the above function is is the distortion-rate function d x : N — > 1Z and is defined by 

dj(r) = min{d(x,y) : C{y) < r}. 
These functions are analogs for individual source words x of the Shannon's rate-distortion function defined in (8) 
and its related distortion-rate function, expressing the least expected rate or distortion at which outcomes from a 
random source X can be transmitted, see Appendix A. 

C. Canonical Rate-Distortion Function 

Let X = {0, 1}* be the source alphabet, y a destination alphabet, and d a distortion measure. 
Definition 2: A distortion ball B(y, S) centered on y e y with radius S e Q is defined by 

B(y,5)={xe X :d(x iy )<6}, 

and its cardinality is denoted by b(y,5) = \B(y,S)\. (We will consider only pairs (y,d) such that all distortion 
balls are finite.) If the cardinality b(y, S) depends only on S but not on the center y, then we denote it by b(S). The 
family A d ' y is defined as the set of all nonempty distortion balls. The restriction to strings of length n is denoted 

by A d n y. 

To define the canonical rate-distortion function we need the notion of the Kolmogorov complexity of a finite set. 

Definition 3: Fix a computable total order on the set of all strings (say the order defined in (1)). The Kolmogorov 
complexity C(A) of a finite set is defined as the length of the shortest string p such that the universal reference 
Turing machine U given p as input prints the list of all elements of A in the fixed order and halts. We require 
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that the constituent elements are distinguishable so that we can tell them apart. Similarly we define the conditional 
versions C(A | z) and C(z | A) where A is a finite set of strings and z is a string or a finite set of strings. 

REMARK 1: In Definition 3 it is important that U(p) halts after printing the last element in the list — in this way 
we know that the list is complete. If we allowed U(p) to not halt, then we would obtain the complexity of the 
so-called implicit description of A, which can be much smaller than C(A). 

REMARK 2: We can allow U(p) to output the list of elements in any order in Definition 3. This flexibility 
decreases C(A) by at most a constant not depending on A but only depending on the order in (1). The same 
applies to C(A \ z). On the other hand, if A occurs in a conditional, such as in C(z \ A), then it is important that 
elements of A are given in the fixed order. This is the case since the order in which the elements of A are listed 
can provide extra information. 

Definition 4: Fix a computable bijection 4> from the family of all finite subsets of {0, 1}* to {0, 1}*. Let A be 
a finite family of finite subsets of X = {0, 1}*. Define the Kolmogorov complexity C(A) by C(A) = C({(f>(A)) : 

A e A}). 

Remark 3: An equivalent definition of C(A \ z) and C(z | A) as in Definition 3 is as follows. Let 4> be as in 
Definition 4. Then we can define C(A \ z) by C(<j>{A) \ z) and C(z \ A) by C(z \ <p{A)). 
DEFINITION 5: For every string x the canonical rate-distortion function g x : J\f — > J\f is defined by 

5,(0 - min {C(B) : x e B, log \B\ < I}. 

BeA d -y 

In a similar way we can define the canonical distortion- rate function: 

h x (j) = B min y {log|S| : x G B, C(B) < j}. 

Definition 6: A distortion family A is a set of finite nonempty subsets of the set of source words X = {0, 1}*. 
The restriction to source words of length n is denoted by A n . 

Every destination alphabet y and distortion measure d gives rise to a set of distortion balls A d ' y , which is 
a distortion family. Thus the class of distortion families obviously includes every family of distortion balls (or 
distortion spheres, which is sometimes more convenient) arising from every combination of destination set and 
distortion measure. It is easy to see that we also can substitute the more general distortion families A for A d,y in 
the definitions of the canonical rate-distortion and distortion-rate function. 

In general, the canonical rate-distortion function of x can be quite different from the rate-distortion function of 
x. However, by Lemma 2 below it turns out that for every distortion measure satisfying certain conditions and for 
every x the rate-distortion function r x is obtained from g x by a simple transformation requiring the cardinality of 
the distortion balls. 

REMARK 4: Fix a string x e X = {0,1}* and consider different distortion families A. Let g x denote the 
canonical rate-distortion function of x with respect to a family A. Obviously, if A C B then g x is pointwise 
not less than g x (and it may happen that g A (i) 3> g x (i) for some i). But as long as A satisfies certain natural 
properties, then the set of all possible g x , when x ranges over X, does not depend on the particular A involved, 
see Theorem 1. 
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D. Use of the Big O Term 

In the sequel we use 'additive constant c' or equivalently 'additive O(l) term' to mean a constant, accounting for 
the length of a fixed binary program, independent from every variable or parameter in the expression in which it 
occurs. Similarly we use '0(/(m, n, . . . ))' to mean a function g(m, n, . . . ) such that g(m, n, . . . ) < c/(m, n, . . . ) + 
c where c is a fixed constant independent from every variable m, n, . . . in the expression. 

III. Distortion Measures 

Since every family of distortion balls is a distortion family, considering arbitrary distortion measures and desti- 
nation alphabets results in distortion families. We consider the following mild conditions on distortion families A: 
Property 1. For every natural number n, the family A contains the set {0, 1}™ of all strings of length n 
as an element. 

Property 2. All x, y e A e A satisfy \x\ — \y\. 

Property 3. Recall that A n = {A e A : A C {0, 1}"}. Then, C(A n ) = O(logn). 

Property 4. For every natural n, let a n denote the minimal number that satisfies the following. For every 
positive integer c every set A G A n can be covered by at most a n \A\/c sets B e A with \B\ < c. Call 
a n the covering coefficient related to A n . Property 4 is satisfied if a n be bounded by a polynomial in n. 
The smaller the covering coefficient is, the more accurate will be the description that we obtain of the 
shapes of the structure functions below. 
The following three example families A satisfy all four properties. 

EXAMPLE 1: C the list distortion family . Let C n be the family of all nonempty subsets of {0, 1}™. This is the 
family of distortion balls for list distortion, which we define as follows. Let X = {0,1}* and y = {J n £ n - A 
source word x € {0, 1}™ is encoded by a destination word which is a subset or list S C {0, 1}™ with x E S. Given 
S, we can retrieve x by its index of log|5| bits in S, ignoring rounding up, whence the name 'list code.' The 
distortion measure is d(x, S) = log | SI if x <G S, and oo otherwise. Thus, distortion balls come only in the form 
B(S, log \S\) with cardinality b(S, log | S\) — \S\. Trivially, the covering coefficient as defined in property 4, for 
the list distortion family C, satisfies a n < 2. Reference [22] describes all possible canonical distortion-rate curves, 
called Kolmogorov's structure function there and first defined in [8]. The distortion-rate function for list distortion 
coincides with the canonical distortion-rate function. The rate-distortion function of x for list distortion is 

r x (S) = s mm^{C{S) : x e S, log \S\ < 5} 

and essentially coincides with the canonical rate-distortion function (g x is the restriction of r x to Af). 
EXAMPLE 2: H the Hamming distortion family. Let X = y = {0, 1}*. A source word x £ {0, 1}™ is encoded 
by a destination word y <G {0, 1}™. For every positive integer n, the Hamming distance between two strings 
x = X\ . . . x n and y = y 1 . . . y n is defined by 

d{x,y) = -\{i : Xi ^ yi }\. (2) 
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If x and y have different lengths, then d(x, y) = oo. A Hamming ball in {0, 1}" with center y e {0, 1}™ and radius 
6 (0 < 6 < 1) is the set = {0, 1}™ : d(x, y) < 5}. Every x is in either 5(00 ... 0, i) or 5(11 ... 1, ±), 

so we need to consider only Hamming distance < S < \. Let H n be the family of all Hamming balls in {0, 1}". 
We will use the following approximation of b(S) — the cardinality of Hamming balls in H n of radius S. Suppose 
that < 5 < \ and 5n is an integer, and let H(S) = 5logl/5 + (1 - S) log 1/(1 - 5) be Shannon's binary entropy 
function. Then, 

2 nH(6)-\ogn/2-0(l) < wjx < jnflfi) _ 

In Appendix E it is shown that the covering coefficient as defined in property 4, for the Hamming distortion family 
Tin, satisfies a n — n ^. The function 

r x {5) = min {C(y) : d(x, y) < 5} 

is the rate-distortion function of x for Hamming distortion. An approximation to one such function is depicted in 
Figure 1. ^> 
EXAMPLE 3: £ the Euclidean distortion family. Let £ n be the family of all intervals in {0, 1}™, where an interval 
is a subset of {0, 1}™ of the form {x : a < x < b} and < denotes the lexicographic ordering on {0, 1}™. Let 
y = {0, 1}*. A source word x € {0, 1}™ is encoded by a destination word y e {0, 1}™. Interpret strings in {0, 1}" 
as binary notations for rational numbers in the segment [0, 1]. Consider the Euclidean distance \x — y\ between 
rational numbers x and y. The balls in this metric are intervals; the cardinality of a ball of radius S is about (52™. 
Trivially, the covering coefficient as defined in property 4, for the Euclidean distortion family £ n , satisfies a n < 2. 
The function 

r x {5) = min {C{y) : \x - y\ < 6} 

ye{0,l}" 

is the rate-distortion function of x for Euclidean distortion. 

All the properties 1 through 4 are straightforward for all three families, except property 4 in the case of the family 
of Hamming balls. 

IV. Shapes 

The rate-distortion functions of the individual strings of length n can assume roughly every shape. That is, every 
shape derivable from a function in the large family G n of Definition 5 below through transformation (4). 

We start the formal part of this section. Let A be a distortion family satisfying properties 1 through 4. 

Property 1 implies that {0, 1}™ € A and property 4 applied to {0, 1}™ and c = 1, for every n, implies trivially 
that the family A contains the singleton set {x} for every x <E {0, 1}*. Hence, 

g x (0) = C({x}) = C(x)+O(l). 

Property 1 implies that for every n and string x of length n, 

flx(n) < C({0, 1}") = C{n) + 0(1) < logn + 0(1). 
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Together this means that for every n and every string x of length n, the function g x {l) decreases from about C(x) 
to about as I increases from to n. 

Lemma 1: Let A be a distortion family satisfying properties 1 through 4. For every n and every string x of 
length n we have g x {n) — O(logn), and < g x (l) — g x {m) < m — I + O(logn) for all / < m < n. 

Proof: The first equation and the left-hand inequality of the second equation are straightforward. To prove 
the right-hand inequality let A witness g x (m) = k, which implies that C{A) = k and log \A\ < m. By Property 4 
there is a covering of A by at most a n \A\/2 l sets in A n of cardinality at most 2 l each. Given a list of A and a list 
of An, we can find such a covering. Let B be one of the covering sets containing x. Then, x can be specified by 
A,n,l, A n and the index i of B among the covering sets. We need also 0(log k + log log i + log log I + log log n) 
extra bits to separate the descriptions of A and A n , and the binary representations of i, n, I, from one another. 
Without loss of generality we can assume that k is less than n. Thus all the extra information and separator bits 
are included in 0(log n) bits. Altogether, C(B) < C(A) +m — l + 0(log n) < k + m-l + 0(log n), which shows 
that g x {l) < k + m — I + O(logn) = g x (m) + m — I + O(logn). ■ 

Example 4: Lemma 1 shows that 

C{x) — i — O(logn) < g x {i) < n — i + O(logn), 
for every < i < n. The right-hand inequality is obtained by setting m = n, I = i in the lemma, yielding 

9x{i) = 9x{i) - g x {n) + O(logn) < n - i + O(logn). 
The left-hand inequality is obtained by setting I = 0, m = i in the lemma, yielding 

C{x) - g x {i) - .9,(0) - g x (i) + 0(1) <i-0 + O(logn). 

The last displayed equation can also be shown by a simple direct argument: x can be described by the minimal 
description of the set A e A witnessing g x (i) and by the ordinal number of a; in A ^> 

The rate-distortion function r x differs from g x by just a change of scale depending on the distortion family 
involved, provided certain computational requirements are fulfilled. See Appendix B for computability notions. 

Lemma 2: Let X = {0, 1}*, y, and d, be the source alphabet, destination alphabet, and distortion measure, 
respectively. Assume that the set {(x, y, 6) G X x y x Q : d(x,y) < 5} is decidable; that y is recursively 
enumerable; and that for every n the cardinality of every ball in A^' y of radius S is at most b n (S) and at least 
b n (5)/f3(n), where (3(n) is polynomial in n and b n (8) is a function of n, 5; and that the distortion family A d ' y 
satisfies properties 1 through 4. Then, for every x <G {0, 1}™ and every rational 6 we have 

r x (5) - g x (\logb n (S)]) + 0(C(S) + logn). (4) 
Proof: Fix n and a string x of length n. Consider the auxiliary function 

f x (5) = mm{C(B(y, 5)) : d(x, y) < 8}. (5) 

We claim that r x (S) = r x (S) + 0(C(S) + \ogn). Indeed, let y witness r x (5) = k. Given y, 6, n we can compute a list 
of elements of the ball B(y,S): for all strings x' of length n determine whether d(x',y) < S. Thus C(B(y,S)) < 
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k + 0(C(S) + logn), hence f x (S) < k + 0(C(S) + \ogn). Conversely, let B{y 1 5) witness f x {5) = k. Given a list of 
the elements of B(y, S) and S we can recursively enumerate y to find the first element y' with B(y', S) — B(y, 6) (for 
every enumerated y' compute the list B(y', S) and compare it to the given list B(y, 5)). Then, C(y') < k + 0(C(S)) 
and d(x,y') < 6. Hence r x {5) <k + 0(C(5)). 
Thus, it suffices to show that 

r x (8)=g x (\logb n (8)]) + 0(logn). 

(gx(\\ogb n (3)~\) < r x (6)) Assume f x {5) = k is witnessed by a distortion ball B(y,5). By our assumption, the 
cardinality of B(y,S) is at most b n (S), and hence g x (\\ogb n (S)~\) < k. 

(f x (S) < g x (\\ogb n (S)~\) + O(logn)) By Lemma 1, g x (l) and g x (l — m) differ by at most m + O(logn). 
Therefore it suffices to show that f x (8) < g x (\\ogb n (8)~\ — m) for some m = O(logn). We claim that this 
happens for m — \\og/3(n)~\ + 1. Indeed, let g x (\ "log b n {S)~\ — m) — k be witnessed by a distortion ball B. Then, 
\B\ < 2nog6n(5)l /(2/3(n)) < b n (5)/f3(n). This implies that the radius of B is less than 5 and hence B witnesses 
r x {5) <k. M 

REMARK 5 : When measuring distortion we usually do not need rational numbers with numerator or denominator 
more than n = \x\. Then, the term 0(C(5)) in (4) is absorbed by the term O(logn). Thus, describing the family of 
g x 's we obtain an approximate description of all possible rate-distortion functions r x for given destination alphabet 
and distortion measure, satisfying the computability conditions, by using the transformation (4). An example of an 
approximate rate-distortion curve r x for some string x of length n for Hamming distortion is given in Figure 1 . 

REMARK 6: The computability properties of the functions r x , d x , and g x , as well as the relation between the 
destination word for a source word and the related distortion ball, is explained in Appendix B. 

We present an approximate description of the family of possible g x 's below. It turns out that the description does 
not depend on the particular distortion family A as long as properties 1 through 4 are satisfied. 

Definition?: Let G n stand for the class of all functions g : {0,1,..., n} — > TV such that g(n) = and 
g(l - 1) e {g(l),g(l) + 1} for all 1 < I < n. 

In other words, a function g is in G n iff it is nonincreasing and the function g(i) + i is nondecreasing and 
g(n) = 0. The following result is a generalization to arbitrary distortion measures of Theorem IV.4 in [22] dealing 
with h x (equaling d x in the particular case of the distortion family £). There, the precision in Item (ii) for source 
words of length n is O(logn), rather than the 0(\/nlogn) we obtain for general distortion families. 

Theorem 1 : Let A be a distortion family satisfying properties 1 through 4. 

(i) For every n and every string x of length n, the function g x (l) is equal to g(l) + O(logn) for some function 
.9 G G n . 

(ii) Conversely, for every n and every function g in G n , there is a string x of length n such that for every 
Z = 0, ... ,n, g x (l) = g(l) + O(Vralogn). 

REMARK 7: For fixed k < n the number of different integer functions g e G n with g(0) — k is (?). For k — 
this number is of order 2"/ v / n, and therefore far greater than the number of strings x of length n and Kolmogorov 
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complexity C{x) = k = \n which is at most 2™/ 2 . This explains the fact that in Theorem 1, Item (ii), we cannot 
precisely match a string x of length n to every function g G G n , and therefore have to use approximate shapes. 
EXAMPLE 5: By Theorem 1, Item (ii), for every g e G n there is a string x of length n that has g for its canonical 
rate-distortion function g x up to an additive 0{\/n log n) term. By (3), (4), and Remark 5, 

r x (6)=g x (nH(S))+0(logn), 

for < S < \. Figure 1 gives the graph of a particular function r(S) = g(nH(5)) with g defined as follows: 




1/6 1/3 1/2 

8 = d(x,y) (distortion) 



Fig. 1. An approximate rate-distortion function for Hamming distortion 

g(l) = n(l + H{\) - H{\)) - I for < I < nH{\), g(l) = n(l + H{\) - H{\)) for nH{\) < I < nH{\), and 
g(l) = n — I for nH(^) < I < n. In this way, g E G n . Thus, there is a string x of length n with its rate-distortion 
graph r x (S) in a strip of size 0(\/n logn) around the graph of r(S). Note that r x is almost constant on the segment 
[i; |]. Allowing the distortion to increase on this interval, all the way from ^ to ^, so allowing n/6 incorrect extra 
bits, we still cannot significantly decrease the rate. This means that the distortion-rate function d x (r) of x drops 
from | to i near the point r = n(l — H(^)), exhibiting a very unsmooth behavior. ^> 

V. Characterization 

Theorem 2 below states that a destination word that codes a given source word and minimizes the algorithmic 
mutual information with the given source word gives no advantage in rate over a minimal Kolmogorov complexity 
destination word that codes the source word. This theorem can be compared with Shannon's theorem, Theorem 5 
in Appendix A, about the expected rate-distortion curve of a random variable. 

Theorem 2: Let A be a distortion family satisfying properties 2 and 3, and A(x) = {A e A : x e A}. 
For every n and string x of length n and every B e A(x) there is an A e A(x) with [log = [log|i?|] 
and C(A) < I(x : B) + 0(logC(B) + logn), where I(x : B) = C{B) - C(B | x) stands for the algorithmic 
information in x about B. 
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For further information about I(x : B) see Definition 11 in Appendix C. The proof of Shannon's theorem, 
Theorem 5, and the proof of the current theorem are very different. The latter proof uses techniques that may be 
of independent interest. In particular, we use an online set cover algorithm where the sets come sequentially and 
we always have to have the elements covered that occur in a certain number of sets, Lemma 6 in Appendix F. 

EXAMPLE 6: Theorem 2 states that for an appropriate distortion family A of nonempty finite subsets of {0, 1}* 
and for every string x £ {0, 1}*, if there exists an A £ A of cardinality 2 l or less containing x that has small 
algorithmic information about x, then there exists another set B £ A containing x that has also at most 2 l elements 
and has small Kolmogorov complexity itself. For example, in the case of Hamming distortion, if for a given string 
x there exists a string y at Hamming distance 8 from x that has small information about x, then there exists another 
string z that is also within distance 8 of x and has small Kolmogorov complexity itself (not only small algorithmic 
information about x). 

VI. Fitness of Destination Word 

In Theorem 3 we show that if a destination word of a certain maximal Kolmogorov complexity has minimal 
distortion with respect to the source word, then it also is the (almost) best-fitting destination word in the sense 
(explained below) that among all destination words of that Kolmogorov complexity it has the most properties in 
common with the source word. 'Fitness' of individual strings to an individual destination word is hard, if not 
impossible, to describe in the probabilistic framework. However, for the combinatoric and computational notion of 
Kolmogorov complexity it is natural to describe this notion using 'randomness deficiency' as in Definition 8 below. 

Reference [22] uses 'fitness' with respect to the particular distortion family £. We briefly overview the general- 
ization to arbitrary distortion families satisfying properties 2 and 3 (details, formal statements and proofs about C 
can be found in the cited reference). The goodness of fit of a destination word y for a source word x with respect 
to an arbitrary distortion family A is defined by the randomness deficiency of x in the the distortion ball B(y,S) 
with 8 = d(x,y). The lower the randomness deficiency, the better is the fit. 

DEFINITION 8: The randomness deficiency of a; in a set A with x £ A is defined as 8(x | A) = log \A\ — C(x | A). 
If S(x | A) is small then a; is a typical element of A. Here 'small' is taken as 0(1) or O(logn) where n = \x\, 
depending on the context of the future statements. 

The randomness deficiency can be little smaller than 0, but not more than a constant. 

Definition 9: Let (i be an integer parameter and P C A. We say P is a property in A if P is a 'majority' 
subset of A, that is, \P\ > (1 - 2 f3 )\A\. We say that x £ A satisfies property P if x £ P. 

If the randomness deficiency 8(x \ A) is not much greater than 0, then there are no simple special properties 
that single x out from the majority of strings to be drawn from A. This is not just terminology: If <5(cc|A) is small 
enough, then x satisfies all properties of low Kolmogorov complexity in A (Lemma 4 in Appendix D). If A is a set 
containing x such that 8{x \ A) is small then we say that x is a set of good fit for x. In [22] the notion of models 
for x is considered: Every finite set of strings containing a; is a model for x. Let x be a string of length n and 
choose an integer i between and n. Consider models for x of Kolmogorov complexity at most i. Theorem IV.8 
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and Remark IV. 10 in [22] show for the distortion family C that x has minimal randomness deficiency in every set 
that witnesses h x (i) (for C we have h x (i) = d x (i)), ignoring additive O(logn) terms. That is, up to the stated 
precision every such witness set is the best-fitting model that is possible at model Kolmogorov complexity at most 
i. It is remarkable, and in fact unexpected to the authors, that the analogous result holds for arbitrary distortion 
families provided they satisfy properties 2 and 3. 

Theorem 3 : Let A be a distortion family satisfying properties 2 and 3 and x a string of length n. Let B be a 
set in A with x G B. Let A x be a set of minimal Kolmogorov complexity among the sets A e A with x e A and 
\\og\A\\ = \\og\B\l Then, 

C{A X ) + log 1^,1 - C{x) < S(x | B) + 0(log C(B) + logn). 
Lemma 3: For every set A with x e A, 

C(A) + log |A| - C(x) > 5(x | A), (6) 

up to a O(logn) additive term. 

Proof: The inequality (6) means that that 

C(A) + log \A\ - C(x) > log \A\ - C(x | A) + O(logn), 

that is, C(ie) < C(A) + C(x \ A) + O(logn). The latter inequality follows from the general inequality C(x) < 
C{x, y) < C{y) + C(x | y) + 0(log C(x \ y)), where C(x \ y) < C(x) + 0(1) < n + 0(1). ■ 

A set A with i e 4 is an algorithmic sufficient statistic for x if O(A) + log \A\ is close to C(x). Lemma 3 
shows that every sufficient statistic for a; is a model of a good fit for x. 

Example 7: Consider the elements of every A e A uniformly distributed. Assume that we are given a string x 
that was obtained by a random sampling from an unknown set B e A satisfying C(B) < n — \x\. Given x we want 
to recover B, or some A e A that is "a good hypothesis to be the source of x" in the sense that the randomness 
deficiency S(x \ A) is small. Consider the set A x from Theorem 3 as such a hypothesis. We claim that with high 
probability 5{x \ A x ) is of order O(logn). More specifically, for every (5 the probability of the event S(x \ A x ) > (3 
is less than 2~^+°( log ™), which is negligible for (3 — O(logn). Indeed, if x is chosen uniformly at random in B, 
then with high probability (Appendix D) the randomness deficiency 8(x \ B) is small. That is, with probability 
more than 1 — we have S(x \ B) < (3. By Theorem 3 and (6) we also have 8{x \ A x ) < S(x \ B) + O(logn). 
Therefore the probability of the event 5(x | A x ) > (3 is less than 2-' 3 +°( 1 °s™). <> 

EXAMPLE 8 : Theorem 3 says that for fixed log-cardinality / the model that has minimal Kolmogorov complexity 
has also minimal randomness deficiency among models of that log-cardinality. Since g x satisfies Lemma 1, we 
have also that for every k the model of Kolmogorov complexity at most k that minimizes the log-cardinality also 
minimizes randomness deficiency among models of that Kolmogorov complexity. These models can be computed 
in the limit, in the first case by running all programs up to k bits and always keeping the one that outputs the 
smallest set in A containing x, and in the second case by running all programs up to n = \x\ bits and always 
keeping the shortest one that outputs a set in A containing x having log-cardinality at most I. ^> 
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VII. Denoising 

In Theorem 3 using (6) we obtain 

8{x | A x ) < 8{x | B) + 0(\ogC(B) + logn). (7) 

This gives a method to identify good-fitting models for x using compression, as follows. Let k = C(A X ) and 
I = [log \B\~\. If A x is a set of minimal Kolmogorov complexity among sets A e A with iei and [log |^4|] = I, 
then by (7) the hypothesis "x is chosen at random in A x " is (almost) at least as plausible as the hypothesis "x is 
chosen at random in B" for every simply described B £ A (say, log C(B) = O(logn)) with [log \B\~\ = I. 

Let us look at an example of denoising by compression (in the ideal sense of Kolmogorov complexity) for 
Hamming distortion. Fix a target string y of length n and a distortion < 8 < \. (This string y functions as 
the destination word.) Let a string i be a noisy version of y by changing at most nS randomly chosen bits in y 
(string x functions as the source word). That is, the string x is chosen uniformly at random in the Hamming ball 
B = B(y,8). Let i be a string witnessing r x (S), that is, x is a string of minimal Kolmogorov complexity with 
d(x,x) < 8 and r x (S) = C(x). We claim that at distortion 8 the string x is a good candidate for a denoised version 
of x, that is, the target string y. This means that in the two-part description (x, x © x) of x, the second part (the 
bitwise XOR of x and x) is noise: x © x is a random string in the Hamming ball £> (00 . . . 0, <5) in the sense that 
5(x © x | i?(00. . .0,(5)) is negligible. Moreover, even the conditional Kolmogorov complexity C(x © x \ x) is 
close to log b(8). 

Indeed, let / = [log \B\~\. By Definition 5 of g x , Theorem 3 implies that 

g x (l) + l-C(x)<5(x\B), 

ignoring additive terms of O(logn) and observing that the additive term logC(-B) is absorbed by O(logn). For 
every x, the rate-distortion function r x of x differs from g x just by changing the scale of the argument as in (4). 
More specifically, we have r x {8) = g x (l) and hence 

r x (6) + l-C(x) < 8{x | B). 

Since we assume that x is chosen uniformly at random in B, the randomness deficiency S(x \ B) is small, say 
O(logn) with high probability. Since r x {8) = C(x) = C(B(x,8))+0(C(8)), C{8) = O(logn), and I = [log b(8)], 
it follows that with high probability, and the equalities up to an additive O(logn) term, 

= C(x) + I - C(x) = C(B(x, 8)) + log b{8) - C(x). 

Since by construction x <G B(x, 8), the displayed equation shows that the ball B(x, 8) is a sufficient statistic for x. 
This implies that a; is a typical element of B(x, 8), that is, C(x © x \ x) = C(x \ x) = C(x \ B(x, S),p) is close 
to logb(S). Here p is an appropriate program of 0(C(8)) = O(logn) bits. 

This provides a method of denoising via compression, at least in theory. In order to use the method practically, 
admittedly with a leap of faith, we ignore the ubiquitous O(logn) additive terms, and use real compressors to 
approximate the Kolmogorov complexity, similar to what was done in [10], [11]. The Kolmogorov complexity 
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Fig. 2. Denoising of the noisy cross 



is not computable and can be approximated by a computable process from above but not from below, while a 
real compressor is computable. Therefore, the approximation of the Kolmogorov complexity by a real compressor 
involves for some arguments errors that can be high and are in principle unknowable. Despite all these caveats it 
turns out that the practical analogue of the theoretical method works surprisingly well in all experiments we tried 
[15]. 

As an example, we approximated the distortion-rate function of a noiseless cross of very low Kolmogorov 
complexity, to which artificial noise was added to obtain a noisy cross, [15]. Figure 2 shows two graphs. The first 
graph, hitting the horizontal axis at about 3100 bits, denotes the Hamming distortion on the vertical axis of the best 
model for the noisy cross with respect to the original noisy cross at the rate given on the horizontal axis. The line 
hits zero distortion at model cost bit rate about 3100, when the original noisy cross is retrieved. The best model 
of the noisy cross at this rate, actually the original noisy cross, is attached to this point. The second graph, hitting 
the horizontal axis at about 250 bits, denotes on the vertical axis the Hamming distortion of the best model for the 
noisy cross with respect to the noiseless cross at the rate given on the horizontal axis. The line hits almost zero 
distortion (Hamming distance 3) at model cost bit rate about 250. The best model of the noisy cross at this rate is 
attached to this point. (The three wrong bits are at the bottom left corner and upper right armpit.) This coincides 
with a sharp slowing of the rate of decrease of the first graph. Subsequently, the second graph rises again because 
the best model for the noisy cross starts to model more noise. Thus, the second graph shows us the denoising of 
the noisy cross, underfitting left of the point of contact with the horizontal axis, and overfitting right of that point. 
This point of best denoising can also be deduced from the first graph, where it is the point where the distortion-rate 
curve sharply levels off. Since this point has distortion of only 3 to the noiseless cross, the distortion-rate function 
separates structure and noise very well in this example. 

In the experiments in [15] a specially written block sorting compression algorithm with a move-to-front scheme 
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as described in [3] was used. The algorithm is very similar to a number of common general purpose compressors, 
such as bzip2 and zzip, but it is simpler and faster for small inputs; the source code (in C) is available from the 
authors of [15]. 

VIII. Algorithmic versus Probabilistic Rate-Distortion 

Theorem 4 shows that Shannon's rate-distortion function r n {5) of (8) for a random variable is pointwise related 
to the expected value of the rate-distortion functions r x (S) of the individual string x G A n (outcomes of the random 
variable with the expectation taken over the probabilities of the random variable). This result generalizes [25], [13], 
[20] to arbitrary computable sources. 

Formally, probabilistic rate-distortion theory is treated in Appendix A. Let X and Y be finite alphabets where 
we take X = {0, 1} for convenience. We generalize the setting from i.i.d. random variables to more general 
random variables. Let X\ , X 2 , . ■ . , X n be a sequence of, possibly dependent, random variables with values in X" 
such that p(xix 2 ■ ■ ■ x n ) = P(Xi = X\,X 2 = x 2 , ■ ■ ■ , X n = x n ) is rational. With X = X\,X 2 , . . . , X n and 
x = x\x 2 . . . x n , let C(X) denote the Kolmogorov complexity of the set of pairs (x,p(x)) ordered lexicographic. 
Let E : X™ — > Y™ be a code. Define the Shannon rate-distortion function by 

r n {5) = min{log|£(X n )| : Ed(x,E(x)) < 5}, (8) 

E 

the expectation E taken over the probability mass function p. 

Theorem 4: Let E be a many-to-one coding function defined by E (x) = y with d(x,y) < S and r x (5) = 
C(y). Let \x\ = n. Then, 

Er x (S) - Ai < r n {5) < min (et x (J) + A 2 , max r x (S)\ , 

x£X n J 

with Ai = 0(C(5,r n ,X,n)), A 2 = H{L) - H(S) with S(y) = J2{p( x ) E o{x) = y}, L(y) is the uniform 
distribution over the y's over Y™, and the expectation E is taken over p. 

Note that we have taken X = {0,1}™ = X" and y = Y™. The Ai quantity satisfies lim„^ooAi/n 
0. The quantity A 2 is small only in the case where we have asymptotic equidistribution. This is the original 
setting of Shannon. Though independence is not needed, for example ergodic stationarity guarantees asymptotic 
equidistribution. 

Appendix 

A. Shannon Rate Distortion 

Classical rate-distortion theory was initiated by Shannon in [17], [18], and we briefly recall his approach. Let X 
and Y be finite alphabets. A single-letter distortion measure is a function d that maps elements of X x Y to the 
reals. Define the distortion between word x and y of the same length n over alphabets X and Y, respectively, by 

1 ™ 

d n (x,y) = - ^d(xi,yi). 
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Let X be a random variable with values in X. Consider the random variable X n with values in X", that is, the 
sequence X\, . . . , X n of n independent copies of X. We want to encode words of length n over X by words over 
Y so that the number of all code words is small and the expected distortion between outcomes of X n and their 
codes is small. The tradeoff between the expected distortion and the number of code words used is expressed by 
the rate -distortion function denoted by r n (S) as in (8). It maps every S e TZ to the minimal natural number r 
(we call r the rate) having the following property: There is an encoding function E : X™ — ► Y™ with a range of 
cardinality at most 2 r such that the expected distortion between the outcomes of X n and their corresponding codes 
is at most S. 

In [18] Shannon gave the following nonconstructive asymptotic characterization of r n (S). Let Z be a random 
variable with values in Y. Let H(Z), H(Z \ X) stand for the Shannon entropy and conditional Shannon entropy, 
respectively. Let I(X; Z) = H(Z) — H(Z \ X) denote the mutual information in X and Z, and Ed(X, Z) stand 
for the expected value of d(x, z) with respect to the joint probability P(X = x, Z = z) of the random variables 
X and Z. For a real S, let R(S) denote the minimal I(X; Z) subject to F,d(X, Z) < 8. That such a minimum is 
attained for all S can be shown by compactness arguments. 

THEOREM 5: For every n and S we have r n (S) > nR{5). Conversely, for every 5 and every positive e, we have 
r n (5 + e) < n(R(5) + e) for all large enough n. 

B. Computability 

In 1936 A.M. Turing [21] defined the hypothetical 'Turing machine' whose computations are intended to give 
an operational and formal definition of the intuitive notion of computability in the discrete domain. These Turing 
machines compute integer functions, the computable functions. By using pairs of integers for the arguments and 
values we can extend computable functions to functions with rational arguments and/or values. The notion of 
computability can be further extended, see for example [9]: A function / with rational arguments and real values 
is upper semicomputable if there is a computable function (j)(x, k) with x an rational number and k a nonnegative 
integer such that 4>(x,k + 1) < (j>{x,k) for every k and Hindoo <j>{x, k) — f(x). This means that / can be 
computably approximated from above. A function / is lower semicomputable if — / is upper semicomputable. 
A function is called semicomputable if it is either upper semicomputable or lower semicomputable or both. If a 
function / is both upper semicomputable and lower semicomputable, then / is computable. A countable set S is 
computably (or recursively) enumerable if there is a Turing machine T that outputs all and only the elements of S 
in some order and does not halt. A countable set S is decidable (or recursive) if there is a Turing machine T that 
decides for every candidate a whether a e S and halts. 

Example 9: An example of a computable function is f(n) defined as the nth prime number; an example of a 
function that is upper semicomputable but not computable is the Kolmogorov complexity function C in Appendix C. 
An example of a recursive set is the set of prime numbers; an example of a recursively enumerable set that is not 
recursive is {x e M : C(x) < \x\}. 
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Let X = {0, 1}*, and y and the distortion measure d be given. Assume that y is recursively (= computably) 
enumerable and the set {(x, y, 5) G X xy x Q : d(x,y) < 5} is decidable. Then r x is upper semicomputable. 
Namely, to determine r x {8) proceed as follows. Recall that U is the reference universal Turing machine. Run U(p) 
for all p dovetailed fashion (in stage k of the overall computation execute the ith computation step of the (k — i)th 
program). Interleave this computation with a process that recursively enumerates y. Put all enumerated elements 
of y in a set W. Whenever U(p) halts we put the output in a set U. After every step in the overall computation we 
determine the minimum length of a program p such that U(p) e W f] U and d(x, U(p)) < S. We call p a candidate 
program. The minimal length of all candidate programs can only decrease in time and eventually becomes equal 
to r x (S). Thus, this process upper semicomputes r x (S). 

The function g x is also upper semicomputable. The proof is similar to that used to prove the upper semicom- 
putability of r x . It follows from [22] that in general d x , and hence its 'inverse' r x and by Lemma 2 the function 
g x , are not computable. 

Assume that the set y is recursively enumerable and the set {(x, y, 5) e X x y x Q : d(x, y) < 5} is decidable. 
Assume that the resulting distortion family A d ' y satisfies Property 2. There is a relation between destination words 
and distortion balls. This relation is as follows. 

(i) Communicating a destination word y for a source word x knowing a rational upper bound 5 for the distortion 
d(x, y) involved is the same as communicating a distortion ball of radius 6 containing x. 

(ii) Given (a list of the elements of) a distortion ball B we can upper semicompute the least distortion 5 such 
that B = B(y, S) for some y e y. 

Ad (i). This implies that the function f x (8) defined in (5) differs from r x (5) by 0(C(S) +log \x\). See the proof 
of Lemma 2. 

Ad (ii). Let B be a given ball. Recursively enumerating y and the possible /3 e Q, we find for every newly 
enumerated element of y e y whether B(y,/3) = B (see the proof of Lemma 2 for an algortihm to find a list of 
elements of B(y,/3) given y, /?). Put these 0's in a set W. Consider the least element of W at every computation 
step. This process upper semicomputes the least distortion S corresponding to the distortion ball B. 

C. Kolmogorov Complexity 

For precise definitions, notation, and results see the text [9]. Informally, the Kolmogorov complexity, or algorithmic 
entropy, C(x) of a string x is the length (number of bits) of a shortest binary program (string) to compute x on 
a fixed reference universal computer (such as a particular universal Turing machine). Intuitively, C{x) represents 
the minimal amount of information required to generate x by any effective process. The conditional Kolmogorov 
complexity C(x \ y) of x relative to y is defined similarly as the length of a shortest binary program to compute 
x, if y is furnished as an auxiliary input to the computation. 

Let Ti, T 2 , . . . be a standard enumeration of all (and only) Turing machines with a binary input tape, for example 
the lexicographic length-increasing ordered syntactic Turing machine descriptions, [9], and let <f>i,<f>2, ■ ■ ■ be the 
enumeration of corresponding functions that are computed by the respective Turing machines (Tj computes fa). 
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These functions are the computable (or recursive) functions. For the development of the theory we actually require 
the Turing machines to use auxiliary (also called conditional) information, by equipping the machines with a 
special read-only auxiliary tape containing this information at the outset. Let (-, •) be a computable one to one 
pairing function on the natural numbers (equivalently, strings) mapping {0, 1}* x {0, 1}* — > {0, 1}* with \(u, v)\ < 
\u\ + \v\ + 0(log(|u|)). (We need the extra 0(log(|u|)) bits to separate u from v. For Kolmogorov complexity, it 
is essential that there exists a pairing function such that the length of (u, v) is equal to the sum of the lengths of 
u, v plus a small value depending only on \u\.) We denote the function computed by a Turing machine Tj with p 
as input and y as conditional information by 4>i(p,y). 

One of the main achievements of the theory of computation is that the enumeration 7i,T 2 , . . . contains a machine, 
say T u , that is computationally universal in that it can simulate the computation of every machine in the enumeration 
when provided with its index. It does so by computing a function <f> u such that <j) u ((i,p),y) = <j>i{p, y) for all i,p,y. 
We fix one such machine and designate it as the reference universal Turing machine or reference Turing machine 
for short. 

DEFINITION 10: The conditional Kolmogorov complexity of x given y (as auxiliary information) with respect to 
Turing machine Tj is 

d(x | y) = min{|p| : (fii(p,y) = x}. (9) 
p 

The conditional Kolmogorov complexity C(x | y) is defined as the conditional Kolmogorov complexity C u (x | y) 
with respect to the reference Turing machine T u usually denoted by U. The unconditional version is set to C(x) = 
C(x | e). 

Kolmogorov complexity C(x | y) has the following crucial property: C(x \ y) < d(x \ y)+Ci for all i, x, y, where 
Cj depends only on i (asymptotically, the reference Turing machine is not worse than any other machine). Intuitively, 
C(x \y) represents the minimal amount of information required to generate x by any effective process from input 
y. The functions C(-) and C(- | •), though defined in terms of a particular machine model, are machine-independent 
up to an additive constant and acquire an asymptotically universal and absolute character through Church's thesis, 
see for example [9], and from the ability of universal machines to simulate one another and execute any effective 
process. The Kolmogorov complexity of an individual finite object was introduced by Kolmogorov [7] as an absolute 
and objective quantification of the amount of information in it. The information theory of Shannon [17], on the 
other hand, deals with average information to communicate objects produced by a random source. Since the former 
theory is much more precise, it is surprising that analogs of theorems in information theory hold for Kolmogorov 
complexity, be it in somewhat weaker form. For example, let X and Y be random variables with a joint distribution. 
Then, H(X, Y) < H(X) + H(Y), where H (X) is the entropy of the marginal distribution of X. Similarly, let 
C(x, y) denote C((x, y)) where (•, •) is a standard pairing function as defined previously and x, y are strings. Then 
we have C(x,y) < C(x) + C(y) + 0(log C[x)). Indeed, there is a Turing machine Ti that provided with (p, q) 
as an input computes (U(p),U(q)) (where U is the reference Turing machine). By construction of Ti, we have 
Ci(x,y) < C(x) + C{y) + 0{logC(x)), hence C{x,y) < C(x) + C(y) + 0(logC{x)). 
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Another interesting similarity is the following: I(X; Y) = H(Y) — H(Y \ X) is the (probabilistic) information 
in random variable X about random variable Y. Here H(Y \ X) is the conditional entropy of Y given X. Since 
I(X;Y) = I(Y;X) we call this symmetric quantity the mutual (probabilistic) information. 

DEFINITION 1 1 : The (algorithmic) information in x about y is I(x : y) = C(y) — C(y \ x), where x, y are finite 
objects like finite strings or finite sets of finite strings. 

It is remarkable that also the algorithmic information in one finite object about another one is symmetric: 
I(x : y) — I(y : x) up to an additive term logarithmic in C(x) + C(y). This follows immediately from the 
symmetry of information property due to A.N. Kolmogorov and L. A. Levin: 

C(x, y) = C(x) + C(y \ x) + 0(log(C(x) + C(y))) (10) 

= C(y) + C(x | y) + O(k>g(<7(a0 + C{y))). 

D. Randomness Deficiency and Fitness 

Randomness deficiency of an element a; of a finite set A according to Definition 8 is related with the fitness of 
x € A (identified with the fitness of set A as a model for x) in the sense of x having most properties represented 
by the set A. Properties are identified with large subsets of A whose Kolmogorov complexity is small (the 'simple' 
subsets). 

Lemma 4: Let /3,7 be constants. Assume that P is a subset of A with \P\ > (1 - 2~ /3 )| J 4| and C'(P | A) < 7. 
Then the randomness deficiency 5(x \ A) of every x € A \ P satisfies S(x \ A) > (3 — 7 — (3(loglog \ A\) 

Proof: Since S(x | A) = log \A\ - C(x | A) and C(x | A) < C(x \A,P) + C(P | A) + 0(\ogC{x | A, P)), 
while C(x \A,P)< -/3 + log \A\ + (9(1) < log\A\ + 0(1), we obtain 5(x \ A) > (3 - 7 - 0(loglog \A\). ■ 

The randomness deficiency measures our disbelief that x can be obtained by random sampling in A (where all 
elements of A are equiprobable). For every A, the randomness deficiency of almost all elements of A is small: The 
number of x e A with S(x \ A) > (3 is fewer than |^4|2~' 3 . This can be seen as follows. The inequality S(x \ A) > (3 

implies C(x \ A) < log \A\ - [3. Since 1 + 2 + 2 2 H h 2 i_1 = 2* - 1, there are less than 2 log l^l^ programs 

of fewer than log|A| — (3 bits. Therefore, the number of x's satisfying the inequality C(x \ A) < log |j4| — (3 
cannot be larger. Thus, with high probability the randomness deficiency of an element randomly chosen in A is 
small. On the other hand, if 8(x \ A) is small, then there is no way to refute the hypothesis that x was obtained by 
random sampling from A: Every such refutation is based on a simply described property possessed by a majority of 
elements of A but not by x. Here it is important that we consider only simply described properties, since otherwise 
we can refute the hypothesis by exhibiting the property P = A \ {x}. 

E. Covering Coefficient for Hamming Distortion 

The authors find it difficult to believe that the covering result in the lemma below is new. But neither a literature 
search nor the consulting of experts has turned up an appropriate reference. 

Lemma 5: Consider the distortion family H n - For all < d < S < | every Hamming ball of radius S in H n 
can be covered by at most a n b(8)/b(d) Hamming balls of radius d in H n , where a n is a polynomial in n. 
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Proof: Fix a ball with center y and radius 5 = j/n < 5 where j is a natural number. All the strings in the 
ball that are at Hamming distance at most d from y can be covered by one ball of radius d with center y. Thus it 
suffices, for every A of the form i/n with i = 2, 3, . . . , j (such that d < A < S), to cover the set of all the strings 
at distance precisely A from y by n c+1 b(5)/b(d) balls of radius d for some fixed constant c. Then the ball B(y, S) 
is covered by at most jn c+1 b(S)/b(d) < n c+2 b(S)/b(d) balls of radius d. 

Fix A and let the Hamming sphere S denote the set of all strings at distance precisely A from y. Let / be the 
solution to the equation d+ /(l - 2d) = A rounded to the closest rational of the form i/n. Since d < A < 5 < | 
this equation has a unique solution and it lies in the closed real interval [0, 1]. Consider a ball B of radius d 
with a random center z at distance / from y. Assume that all centers at distance / from y are chosen with equal 
probabilities l/s(/) where s(f) is the number of points in a Hamming sphere of radius /. 

Claim 1 : Let x be a particular string in S. Then 

Pv(x e B) > W 
v ' ~ n c b(S) 

for some fixed positive constant c. 

Proof: Fix a string z at distance / from y. We first claim that the ball B of radius d with center z covers 

b(d)/n c strings in S. Without loss of generality, assume that the string y consists of only zeros and string z consists 

of fn ones and (1 — f)n zeros. Flip a set of fdn ones and a set of (1 — f)dn zeros in z to obtain a string u. 

The total number of flipped bits is equal to dn and therefore u is at distance d from z. The number of ones in u 

is fn — fdn + (1 — f)dn = An and therefore u e S. Different choices of the positions of the same numbers of 

flipped bits result in different strings in S. The number of ways to choose the flipped bits is equal to 

ffn\f(l-f)n\ 
\fdn) {(l-f)dnj- 

By Stirling's formula, this is at least 

2fnh(d)+(l-f)nh(d)-0{\ogn) _ 2 nh ( d )-0(\ogn) > Kj) 

— n c ' 

where the last inequality follows from (3). Therefore a ball B as above covers at least b(d)/n c strings of S. The 
probability that a ball B, chosen uniformly at random as above, covers a particular string x € S is the same for 
every such x since they are in symmetric position. The number of elements in a Hamming sphere is smaller than 
the cardinality of a Hamming ball of the same radius, |5| < b{5). Hence with probability 

b(d) > b(d) 



n c \S\ ~ n c b{5) 

a random ball B covers a particular string x in S. ■ 
By Claim 1, the probability that a random ball B does not cover a particular string x e S is at most 1 — 
b(d)/(n c b(Sj). The probability that no ball out of N randomly drawn such balls B covers a particular ieS (all 
balls are equiprobable) is at most 

1 _ ^ \ < e -Nb(d)/(n"b(5)) 

n-b(5) ' 
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For TV = n c+1 b(S)/b(d), the exponent of the right-hand side of the last inequality is — n, and the probability that x 
is not covered is at most e~ n . This probability remains exponentially small even after multiplying by |5| < 2™, the 
number of different x's in S. Hence, with probability at least 1 — (2/e)" we have that N random balls of the given 
type cover all the strings in S. Therefore, there exists a deterministic selection of N such balls that covers all the 
strings in S. The lemma is proved. (A more accurate calculation shows that the lemma holds with a n = 0(n 4 ).) 

■ 

COROLLARY 1 : Since all strings of length n are either in the Hamming ball £> (00 . . . 0, |) or in the Hamming 
ball B(ll ... 1, i) in 7i n , the lemma implies that the set {0, 1}" can be covered by at most 



balls of radius d for every < d < \. (A similar, but direct, calculation lets us replace the factor 2a n by n.) 
F. Proofs of the Theorems 

Proof: of Theorem 1 . (i) Lemma 1 (assuming properties 1 through 4) implies that the canonical structure 
function g x of every string x of length n is close to some function in the family G n . This can be seen as follows. 
Fix x and construct g inductively for n, n — 1, . . . , 0. Define g(n) — and 



By construction this function belongs to the family G n . Let us show that g x {l) = g{l) + O(logn). First, we prove 
that 



by induction on I = n, n — 1, . . . , 0. For I = n the inequality is straightforward, since by definition g(n) — 0. Let 
< I < n. Assume that g(i) < g x (i) for i = n, n — 1, . . . , /. If g(l) < g x (l — 1) then g(l — 1) = g(l) + 1 and 
therefore g(l - 1) < g x (l - 1). If g(l) > g x (l - 1) then g(l - 1) = g{l) > g x (l - 1) > g x {l) > g(l) and hence 
g{l-l) = g x {l-l). 
Second, we prove that 



for every I = 0, 1, . . . , n. Fix an I and consider the least m with I < m < n such that g x (m) = g(m). If there is no 
such m we take m = n and observe that g x (n) = O(logn) = g(n) + 0(logn). This way, g x (m) = g(m) + 0(\ogn) 
and for every I < I' < m we have g(l' — 1) < g x {V — 1) due to inequality (11) and definition of m. Then 
g x {l' — 1) > g{l' — 1) > g(l'), since we know that g is nonincreasing. Then, by the definition of g we have 
g(l' — 1) = g(V) + 1. Thus we have g(l) = g(m) + m — I. Hence, g x (l) < g x {m) + m — I + O(logn) = 
g{m) + m — I + O(logn) = g(l) + O(logn), where the inequality follows from Lemma 1, the first equality from 
the assumption that g x (m) = g(m) + O(logn), and the second equality from the previous sentence. 




9(1) < 9x(l) 



(11) 



9x(l) <<?(/) + O(logn) 
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(ii) In Theorem IV.4 in [22] we proved a similar statement for the special distortion family C with an error 
term of O(logn). However, for the special case C we can let x be equal to the first x satisfying the inequality 
<?x(0 > .9(0 — O(logn) for every I. In the general case this does not work any more. Here we construct x together 
with sets ensuring the inequalities g x (I) < g(l) + 0(y/n\ogn) for every I = 0, . . . , n. 

The construction is as follows. Divide the segment {0,1, ... ,n} into N = yjn/ logn subsegments of length 
y/n log n each. Let l — n > l\ > ■ ■ ■ > l N — denote the end points of the resulting subsegments. 

To find the desired x, we run the nonhalting algorithm below that takes n and A n as input together with the 
values of the function g in the points l n , . . . Let 5(n) be a computable integer valued function of n of the order 
log n that will be specified later. 

Definition 12: Let i = 0, 1, . . . , N. A set F e A n is called i-forbidden if |F| < 2 h and C{F) < g(k) - S(n). 
A set is called forbidden if it is i-forbidden for some i — 0, 1, . . . , N. 

We wish to find an x that is outside all forbidden sets (since this guarantees that g x (h) > g{h) — S(n) for every 
i). Since C(-) is upper semicomputable, moreover property 3 holds, and we are also given n and g(lo), ■ ■ • , ff^Ar), 
we are able to find all forbidden sets using the following subroutine. 

Subroutine (n,A n ,g(l ),g(h), . . .,g{l n ))' 

for every F e A n upper semicompute C(F); every time we find C(F) < g(li) — S(n) and \F\ < 2 li for 
some i and F, then print F. End of Subroutine 

This subroutine prints all the forbidden sets in some order. Let F\ , . . . , Ft be that order. Unfortunately we do 
not know when the subroutine will print the last forbidden set. In other words, we do not know the number T of 
forbidden sets. To overcome this problem, the algorithm will run the subroutine and every time a new forbidden 
set F t is printed, the algorithm will construct candidate sets Bo(t), . . . ,Bjs[(t) e A n satisfying |£>i(t)| < 2 li and 
C(Bi(t)) < g(k) + 8{n) and the following condition 

N t 

f)B j (t)\{jF j ^0, (12) 

J=0 j=l 

for every t = 0, . . . , T. For t = T the set \J j=1 Fj is the union of all forbidden sets, which guarantees the bounds 
g(h) — S(n) < g x (h) < g(h) + S(n) for all x in the set in the left hand side of (12). Then we will prove that these 
bounds imply that g(l) — S(n) < g x (l) < g(l) + S(n) for every I = 0, . . . , n. Each time a new forbidden set appears 
(that is, for every t = 1, . . . , T) we will need to update candidate sets so that (12) remains true. To do that we will 
maintain a stronger condition than just non-emptiness of the left hand side of (12). Namely, we will maintain the 
following invariant: for every i = 0, 1, . . . , N, 

3=0 3=1 

Note that for i — N inequality (13) implies (12). 
Algorithm (n,A n ,g{l ),g(h), ■ ■ ■ ,s(f„)): 

Initialize. Recall that l = n. Define the set B t (0) = {0, 1}™ for every t. This set is in A n by property 1. 



> 2' 



(13) 
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for i := 1, . . . , N do 

Assume inductively that |-Bo(0) f]Bi(0) f] ■ ■ ■ f] i?j_i(0)| > 2 h - 1 a~ t+1 , where a n denotes a polynomial 
upper bound of the covering coefficient of distortion family A n existing by property 4. (The value a n 
can be computed from n.) Note that this inequality is satisfied for i = 1. Construct Bi(0) by covering 
£?i_i(0) by at most ct^'*- 1- ' 4 sets of cardinality at most 2 li (this cover exists in A n by property 4). 
Trivially, this cover also covers B o (0) f] ■ ■ ■ f] £?j_i(0). The intersection of at least one of the covering 
sets with £? (0) H - - - (1 ^i-i(O) ^as cardinality at least 



Let Bi(0) by the first such covering set in a given standard order, od 

Notice that after the Initialization the invariant (13) is true for t = 0, as U* = i Fj = - F° r everv 
t = 1,2,... perform the following steps 1 and 2 maintaining the invariant (13): 

Step 1. Run the subroutine and wait until tth forbidden set F t is printed (if t > T the algorithms waits 
forever and never proceeds to Step 2). 
Step 2. 

Case 1. For every i = 0, 1, . . . , N we have 



Note the this inequality has one more forbidden set compared to the invariant (13) for t— 1 (the argument in 
Bj(t— 1)), and thus may be false. If that is the case, then we let Bi(t) = Bi(t— 1) for every i = 1, ... ,N 
(this setting maintains invariant (13)). 

Case 2. Assume that (14) is false for some index i. In this case find the least such index (we will use 
later that (14) is true for all i' < i). 

We claim that i > 0. That is, the inequality (14) is true for i = 0. In other words, the the cardinality of 
F\ |J • • • |J F t is not larger than half of the cardinality of B (t — 1) = {0, 1}™. Indeed, for every fixed i 
the total cardinality of all the sets of simultaneously cardinality at most 2 li and Kolmogorov complexity 
less than g(k) — S(n) does not exceed 2 9 ^ li ^ s ^2 li . Therefore, the total number of elements in U J= i Ft 



where the first inequality follows since the function g(l) + 1 is monotonic nondecreasing, the first equality 
since g(n) = by definition, and the last inequality since we will set S(n) at order of magnitude y/n log n. 
First let Bk(t) = Bk(t — 1) for all k < i (this maintains invariant (13) for all k < i). To define Bi(t) find 
a covering of Bi-\{t) by at most a n 2 li - 1 ~ li sets in A n of cardinality at most 2 li . Since (14) is true for 




(14) 



is at most 




= 2 |{0,1}"|, 
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index i — 1, we have 

i-l t 

> 2 ii - 1 - i a~ i+1 . (15) 



n^(*)\u^ 

J=0 j=l 



Thus the greatest cardinality of an intersection of the set in (15) with a covering set is at least 

ok-l-ify-i+l 

" - = 2 h - l a-\ 



a n 2 l *-i 

Let Bi(t) be the first such covering set in standard order. Note that 2 li ~ l a~ l is at least twice the threshold 
required by invariant (13). Use the same procedure to obtain successively . . . , Sjv(t). 

End of Algorithm 

Although the algorithm does not halt, at some unknown time the last forbidden set F T is enumerated. After this 
time the candidate sets are not changed anymore. The invariant (13) with i = N shows that the cardinality of the 
set in the left hand side of (12) is positive hence the set is not empty. 

Next we show that C(B l (t)) < g{k) + S(n) for every i and every t = 1, . . . , T. We will see that to this end it 
suffices to upperbound the number of changes of each candidate set. 

Definition 13: Let to; be the number of changes of Bi defined by = \{t : Bi(t) ^ Bi(t— 1), 1 < t < T}\ 
for < i < N. 

Claim 2: to; < 2^)+* for < i < N. 

Proof: The Claim is proved by induction on i. For i = the claim is true, since Iq = n and g(n) = while 
too = by initialization in the Algorithm (B(0) never changes). 

(i > 0): assume that the Claim is satisfied for every j with < j < i. We will prove that nii < 2 9 ( ii ' +J by 
counting separately the number of changes of Bi of different types. 

Change of type 1. The set Bi is changed when (14) is false for an index strictly less than i. The number of 
these changes is at most 

TOi_i < 2 9{h - l)+i - 1 < 2 9(li)+i - 1 , 

where the first inequality follows from the inductive assumption, and the second inequality by the property of g 
that it is nonincreasing. Namely, since l^i > k we have g(h-i) < g(h). 

Change of type 2. The inequality (13) is false for i and is true for all smaller indexes. 

Change of type 2a. After the last change of Bi at least one j-forbidden set for some j < i has been enumerated. 
The number of changes of this type is at most the number of j-forbidden sets for j = 0, . . . , i — 1. For every such 
j these forbidden sets have by definition Kolmogorov complexity less than g(lj) — S(n). Since lj > k and g is 
monotonic nonincreasing we have g(lj) < g(k). Because there are at most N of these j's, the number of such 
forbidden sets is at most 

since we will later choose 5(n) of order \Jn log n, 

Change of type 2b. Finally, for every change of this type, between the last change of Bi and the current one no 
candidate sets with indexes less than i have been changed and no j-forbidden sets with j < i have been enumerated. 
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Since after the last change of Bi the cardinality of the set in the left-hand side of (13) was at least 2 li ~ l a~ % , which 
is twice the threshold in the right-hand side by the restoration of the invariant in the Algorithm Step 2, Case 2, 
the following must hold. The cardinality of Uj = i Fj increased by at least 2 li ~' l ~ 1 a~ i since the last change of Bi, 
and this must be due to enumerating j-forbidden sets for j = i, . . . , N. For every such j every j-forbidden set has 
cardinality at most 2 l * and Kolmogorov complexity less than g(lj) —S(n). Hence the total number of elements in all 
j-forbidden sets is less than 2^2 g( - l ^~ 5( - n \ Since j > i and hence lj < k while g(l) + l is monotonic nondecreasing 
we have g(lj) + lj < g(l t ) + l t . Because there are at most N + 1 of these j's, the total number of elements in all 
those sets does not exceed M = (N + 1)2 9 (k)-$(n)+i^ f^e number of changes of this type is not more than the 
total number M of elements involved divided by the increments of size 2 u ~' l ~ 1 a~ l . Hence it is not more than 

{N +l)2 9 ^- s{n h i+l a i n . 

Let 

S(n) > Iog((JV + l)2 i+10 <) and (16) 
5{n) = 0(N\og(2a n )) = 0(^n/\ogn log(2a„)) - O(^nlogn), 

where the last equality uses that a n is polynomial in n by property 4. Then, the number of changes of type 2b is 

much less than 2 9 ^ Li K The value of S(n) can be computed from n. 

Summing the numbers of changes of types 1, 2a, and 2b we obtain m; L < 2 fl (' < ) + \ completing the induction. ■ 
Claim 3: Every x in the nonempty set (12) satisfies \g x (h) — g(k)\ < S(n) with 5(n) = 0(y/n\ogn) for 

i = 0,l,...,N. 

Proof: By construction x is not an element of any forbidden set in (J*=i Ft, and therefore 

9x(k) > g(h) - S(n) 

for every i = 0, 1, . . . , N. By construction \Bi(T)\ < 2 li , and to finish the proof it remains to show that C(Bi(T)) < 
g(h) + S(n) so that g x (h) < g(h) + S(n), for i = 0, 1, . . . , N. Fix i. The set Bi{T) can be described by a constant 
length program, that is 0(1) bits, that runs the Algorithm and uses the following information: 

• A description of i in logiV < logn bits. 

• A description of the distortion family A n in O(logn) bits by property 3. 
« The values of g in the points Iq, . . . ,In in N \ogn = ^/nlogn bits. 

• The description of n in O(logn) bits. 

> The total number m, of changes (Case 2 in the Algorithm) to intermediate versions of Bi in logm^ bits. 
We count the number of bits in the description of Bi(T). The description is effective and by Claim 2 with 
i < N — \fnj log n it takes at most g(li) + 0{\Jn\ogn) bits. So this is an upper bound on the Kolmogorov 
complexity C(Bi(Tj). Therefore, for some S(n) satisfying (16) we have 

gx(k) < g(h) + 5(ri), 

for every i = 0, 1, . . . , N. The claim follows from the first and the last displayed equation in the proof. ■ 
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Let us show that the statement of Claim 3 holds not only for the subsequence of values lo,h, ■ ■ • , Zjv but for 
every I = 0, 1, . . . , n, 

Let k < I < k-i- Both functions g(l),g x (l) are nonincreasing so that 

g(i) e [ff(/i-i),s(J0], 

flx(0 G C - 0(Vnlogn),g(Zi) + 0(y / nlogn)]. 

By the spacing of the sequence of Zj's the length of the segment [g(h-i), g(k)] is at most 

g(h) ~ g(h-i) < k-i -U = V n log n. 

If there is an x such that Claim 3 holds for every k with i = 0, . . . , N, then it follows from the above that 
\g(l) — g x (l)\ < V n 1°S n + 0(y/n logra) for every i = 0, 1, . . . , n. ■ 

Proof: of Theorem 2. We start with Lemma 6 stating a combinatorial fact that is interesting in its own right, 
as explained further in Remark 8. 

Lemma 6: Let n, m, k be natural numbers and x a string of length n. Let B be a family of subsets of {0, 1}™ 
and B(x) = {B e B : x e B}. If B(x) has at least 2 m elements (that is, sets) of Kolmogorov complexity less than 
k, then there is an element in B(x) of Kolmogorov complexity at most fe — m + 0(C(B) + log n + log fe + log m) . 

Proof: Consider a game between Alice and Bob. They alternate moves starting with Alice's move. A move of 
Alice consists in producing a subset of {0, 1}". A move of Bob consists in marking some sets previously produced 
by Alice (the number of marked sets can be 0). Bob wins if after every one of his moves every x € {0, 1}" that 
is covered by at least 2 m of Alice's sets belongs to a marked set. The length of a play is decided by Alice. She 
may stop the game after any of Bob's moves. However the total number of her moves (and hence Bob's moves) 
must be less than 2 fc . (It is easy to see that without loss of generality we may assume that Alice makes exactly 
2 fe — 1 moves.) Bob can easily win if he marks every set produced by Alice. However, we want to minimize the 
total number of marked sets. 

Claim 4: Bob has a winning strategy that marks at most 0(2 fe_m fc 2 n) sets. 

Proof: We present an explicit strategy for Bob, which consists in in executing at every move t = 1, 2, . . . , 2 fe — 1 
the following algorithm for the sequence A\, A2, ■ . ■ , A t which has been produced by Alice until then. 

Step 1. Let 2-? be the largest power of 2 dividing t. Consider the last 2 3 sets in the sequence A\, A 2 , . . . , A t 
and call them D\, . . . , D 2 j. 

Step 2. Let T be the set of x's that occur in at least 2 m /fc of the sets D\, . . . , D 2 j ■ Let D p be a set 
such that \D p f] T\ is maximal. Mark D p (if there is more than one then choose the one with p least) and 
remove all elements of D p f]T from T. Call the resulting set T\. Let D q be a set such that |_D g p|Ti| 
is maximal (if there is more than one then choose the one with q least). After removing all elements of 
D q H T\ from T\ we obtain a set T 2 . Repeat the argument until we obtain T Cj = 0. 
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Firstly, for the j above we have ej < [2 J m fcnln2~|. This is proved as follows. We have 

^|Af) T l ^ \T\2 m /k, 
»=i 

since every x € T is counted at least 2 m /k times in the sum in the left hand side. Thus there is a set in the list 
£>!,..., -D 2 j such that the cardinality of its intersection with T is at least 2~ J times the right hand side. By the 
choice of D p it is such a set and we have \D p f]T\ > \T\2 m ~^ /k. 

The set T has lost at least a (2 m --?'/fc)th fraction of its elements, that is, |Ti| < |T|(1 - 2 m -i/k). Since T x C T, 
obviously every element of T\ (still) occurs in at least 2 m /k of the sets D\,... ,D 2 j- Thus we can repeat the 
argument and mark a set D q with |D 9 p|Ti| > \T\\2 m ~ 3 /k. After removing all elements of D q f]Ti from T\ we 
obtain a set T 2 that is at most a (1 - 2 m ~i /fc)th fraction of Ti, that is, |T 2 | < |Tij(l - 2 m -i/k). 

Recall that we repeat the procedure ej times where ej is the number of repetitions until T ej = 0. It follows that 
ej < |~2^ m fcnln2] since 

|r|(i — 2 m ~i /k) 23 ~ mknin2 < |r|e _nln2 = |T|2 — ™ < 1 

Secondly, for every fixed j = 0, 1, . . . , k — 1 there are at most 2 k ~ j different t's (t = 1, 2, . . . , 2 k - 1) divisible 
by 2 J and the number dj = 2 k ~^ej of marked sets we need to use for this j satisfies dj < 2 k ~i2i~ m kn\xi2 = 

2 k ^ m kn\ia2. For all j = 0, . . . , k — 1 together we use a total number of marked sets of at most 

fe-i 

J2dj < 2 k - m k 2 n\n2. 

3=0 

In this way, after every move t = 1, 2, . . . , 2 h — 1 of Bob, every x occurring in 2 m of Alice's sets belongs to a marked 
set of Bob. This can be seen as follows. Assume to the contrary, that there is an x that occurs in 2 m of Alice's sets 
following move t of Bob, and x belongs to no set marked by Bob in step t or earlier. Let t = 2 n + 2 n + ■ ■ ■ with 
ji > 32 > ■■■ be the binary expansion of t. By Bob's strategy, the element x occurs less than 2 m /k times in the 
first segment of 2 jl sets of Alice, less than 2 m /k times in the next segment of 2 J2 of Alice's sets, and so on. Thus 
its total number of occurrences among the t first sets of Alice is strictly less than k2 m /k = 2 m . The contradiction 
proves the claim. ■ 
Let us finish the proof of the Lemma 6. Given the list of B, recursively enumerate the sets in B of Kolmogorov 
complexity less than k, say B\, B2, . . . , Bt with T < 2 k , and consider this list as a particular sequence of moves 
by Alice. Use Bob's strategy of Claim 4 against Alice's sequence as above. Note that recursive enumeration of the 
sets in B of Kolmogorov complexity less than k means that eventually all such sets will be produced, although we 
do not know when the last one is produced. This only means that the time between moves is unknown, but the 
alternating moves between Alice and Bob are deterministic and sequential. According to Claim 4, Bob's strategy 
marks at most 0(2 k ~ m k 2 n) sets. These marked sets cover every string occurring at least 2 m times in the sets 
Bi,B 2 , ■ ■ ■ , Bt- We do not know when the last set Bt appears in this list, but Bob's winning strategy of Claim 4 
ensures that immediately after recursively enumerating Bi (i < T) in the list every string that occurs in 2 m sets 
in the initial segment Bi, B 2 , ■ ■ ■ B t is covered by a marked set. The Kolmogorov complexity C{Bi) of every 
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marked set B; t in the list B\ , B 2 , . . . , £>t is upper bounded by the logarithm of the number of marked sets, that is 
fc— m+0(log k+\og n), plus the description of B, k, to, and n including separators in 0(C (S)+log k+\og m+log n) 
bits. ■ 
We continue the proof of the theorem. Let the distortion family A satisfy properties 2 and 3. Consider the subfamily 
B of A n consisting of all sets A with [log A] = [logB]. Let B(x) be the family {B £ B : x £ B} and N the 
number of sets in B(x) of Kolmogorov complexity at most C(B). 

Given x, \logB~\,A n and C(B) we can generate all A £ B(x) of Kolmogorov complexity at most C(B). Then 
we can describe B by its index among the generated sets. This shows that the description length C(B \ x) < \ogN 
(ignoring an additive term of order 0(logC(B) + logn) which suffices since C(\\ogB~\) and C(A n ) are both 
O(logn)). 

Since C(A n ) = O(logn) by property 3, B £ A n while every set A £ B satisfies [log \A\] = [log \B\~\ < n, we 
have C(B) — O(logn). Let k = C(B) + 1 and to = [\ogN\, and ignore additive terms of order O(logfc + logm + 
log n). Applying Lemma 6 shows that there is a set A £ B(x) with C(A) < k-m < C(B) -C(B \ x) = I(x : B) 
and therefore proves Theorem 2. ■ 

REMARK 8: Previously an analog of Lemma 6 was known in the case when B is the class of all subsets {0, 1}™ 
affixed cardinality 2 l . For I = this is Exercise 4.3.8 (second edition) and 4.3.9 (third edition) of [9]: If a string x 
has at least 2 m descriptions of length at most k (p is called a description of x if U(p) — x where U is the reference 
Turing machine), then C{x) < k — m + 0(\ogk + log to). Reference [22] generalizes this to all I > 0: If a string 
belongs to at least 2 m sets B of cardinality 2 l and Kolmogorov complexity C(B) < k, then x belongs to a set A 
of cardinality 2 l and Kolmogorov complexity C(A) < k — to + O (log to + logfc + log^). 

REMARK 9: Probabilistic proof of Claim 4. Consider a new game that has the same rules and one additional 
rule: Bob looses if he marks more than 2 k ~ m+1 (n+ 1) In 2 sets. We will prove that in this game Bob has a winning 
strategy. 

Assume the contrary: Bob has no winning strategy. Since the number of moves in the game is finite (less than 
2 fc ), this implies that Alice has a winning strategy. 

Fix a winning strategy S of Alice. To obtain a contradiction we design a randomized strategy for Bob that beats 
Alice's strategy S with positive probability. Bob's strategy is very simple: mark every set produced by Alice with 
probability p = 2-" l (n + 1) In 2. 

Claim 5: (i) With probability more than \, following every move of Bob every element occurring in at least 
2 m of Alice's sets is covered by a marked set of Bob. 

(ii) With probability more than ±, Bob marks at most 2 k ~ m+1 (n + 1) In 2 sets. 

Proof: (i) Fix x and estimate the probability that there is move of Bob following which x belongs to 2 m of 
Alice's sets but belongs to no marked set of Bob. 

Let Ri be the event "following a move of Bob, string x occurs at least in i sets of Alice but none of them is 
marked". Let us prove by induction that 

Pr[R*] < (l-pf. 
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For i — the statement is trivial. To prove the induction step we need to show that Pr[i? i+ i|i?j] < 1 — p. 

Let z = z\, Z2, ■ ■ ■ , Zt be a sequence of decisions by Bob: Zj = 1 if Bob marks the jth set produced by Alice 
and Zj = otherwise. Call z bad if following Bob's tth move it happens for the first time that x belongs to i sets 
produced by Alice by move t but none of them is marked. Then Ri is the disjoint union of the events "Bob has 
made the decisions z" (denoted by Q z ) over all bad z. Thus it is enough to prove that 

Pr[R t+ i | Q z ] < 1 - p. 

Given that Bob has made the decisions z, the event Ri+i means that after those decisions the strategy S will at 
some time in the future produce the (i + l)st set with member x but Bob will not mark it. Bob's decision not to 
mark that set does not depend on any previous decision and is made with probability 1 — p. Hence 

Pr[i?j + i | Q z ] = Pr[Alice produces the (i + l)st set with member x \ Q z ] ■ (1 — p) < 1 — p. 

The induction step is proved. Therefore, Pr[i? 2 ™] < (1 — p) 2 ™ < e~ p2m = 2~"~ 1 , where the last equality follows 
by choice of p. 

(ii) The expected number of marked sets is p2 k . Thus the probability that it exceeds p2 k+1 is less than i. ■ 
It follows from Claim 5 that there exists a strategy by Bob that marks at most 2 fe ~ m+1 (n + 1) In 2 sets out of 
Alice's produced 2 fe sets, and following every move of Bob every element occurring in at least 2 m of Alice's sets 
is covered by a marked set of Bob. Note that we have proved that this strategy of Bob exists but we have not 
constructed it. Given n, k and m, the number of games is finite, and a winning strategy for Bob can be found by 
brute force search. 

Proof: of Theorem 3. Let B C {0,1}™ be a set containing string x. Define the sufficiency deficiency of x in 
B by 

Iog|S| + C(S)-C(a;). 

This is the number of extra bits incurred by the two-part code for x using B compared to the most optimal one-part 
code of x using C{x) bits. We relate this quantity with the randomness deficiency S(x \ B) = log \B\ — C(x \ B) 
of x in the set B. The randomness deficiency is always less than the sufficiency deficiency, and the difference 
between them is equal to C(B \ x): 

log \B\ + C(B) - C(x) - 5(x | B) = C(B \ x), (17) 

where the equality follows from the symmetry of information (10), ignoring here and later in the proof additive 
terms of order 0(\ogC(B) + logn). 

By Theorem 2, which assumes that properties 2 and 3 hold for the distortion family A, there is A € A(x) with 
[log \A\~\ = [log \B\~\ and C(A) < C(B) — C(B \ x). Since A x is a set of minimal Kolmogorov complexity among 
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such A we have C(A X ) < C{B) - C(B | x). Therefore 

C(A X ) + log | A, | - C(x) < C(B) - C(B | x) + log \A X \ - C(x) 

= C{B) - C(B | x) + log \B\ - C(x) = 5(x \ B), 
where the last equality is true by (17). ■ 

Proof: of Theorem 4. 

Left inequality. Given S, n, p, and the (discrete) graph of r n , we can compute an optimal E as in (8) such that 
r n (S) = log|£(X")|. Retrieve E(x) by its index of r n (S) bits in the set E(X n ). Then, 

C(E(x)) < r n (S) + 0(C(6, r n ,X, n)). 

By definition, r x {5) < C(E(x)). Taking the expectation of r x (S) over p, we are done. 

Right inequality. Define a code E n such that C(E (x)) = r x (S) for every x £ X". Let E (X. n ) be the range of 
E Q . Although E (X. n ) cannot be computed, it is finite, and trivially 

log|£o(X n )| < maxC(EoW). 

By definition r n (5) < log |£o(X")|, which yields r n (S) < max l£X ~ r x {8). 
The noiseless coding theorem, [17], [9], shows that 

J2p(x)r x (S)= J2 S(y)C(y) > H(S), 
i£X" j/eE (X") 

with S the distribution defined in the statement of the theorem. By definition, r n (S) < log|Y"|, which yields 
r n (S) < H(L), with L as in the statement of the theorem. Together, we obtain r"(<5) < Er x (S) + A 2 . ■ 
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